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Introduction 


polymorphism,  n.(l):  capability 
of  assuming  different  forms;  cap¬ 
ability  of  wide  variation. 

-Webster's  Third  International  Dictionary- 

When  von  Neumann  computers  were  still  new  and  exciting, 
scientists  noted  in  popular  accounts  that  unlike  mechanical  machines, 
computers  are  polymorphic  -  their  function  can  be  radically  changed 
simply  by  changing  programs.  Polymorphism  is  fundamental,  but 
it  quickly  became  familiar  to  the  point  of  being  obvious  and  has  been 
mentioned  little  since,  even  though  it  has  continued  to  underlie 
important  advances  such  as  time-sharing  and  programmable  microcode. 

Now,  as  we  are  confronted  with  the  potential  for  highly  parallel  com¬ 
puters  made  possible  by  very  large  scale  integrated  (VLSI)  circuit 
technology,  we  may  ask: 

What  is  the  role  of  polymorphism  in  parallel  computation? 

To  answer  this  question,  we  must  review  the  characteristics  of  parallel 
processing  and  the  benefits  and  limitations  of  VLSI  technology. 
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Algorithmically  Specialized  Processors 
Perhaps  the  most  important  property  of  VLSI  circuit  technology  is 
that  the  manufacturing  processes  use  photolithographic  means  to  create 
copies  of  a  circuit.  Fabrication  by  photolithography  (or  the  newer 
techniques  such  as  electron  beam  lithography)  requires  a  fixed  number 
of  steps  to  produce  a  circuit,  independent  of  the  circuit's  complexity. 

It  costs  no  more  to  make  copies  of  a  chip  containing  a  NAND  gate  than 
to  make  copies  of  a  chip  containing  a  microprocessor,  although  yields  will 
likely  be  higher  for  the  former  and  wire  bonding  costs  higher  for  the 
latter.  Preparing  and  debugging  the  lithographic  masks  is  expensive, 
so  the  technology  favors  parallel  processing  techniques  that  employ 
many  copies  of  the  same,  possibly  complex  circuit. 

Recognition  of  uniformity  as  the  source  of  leverage  in  VLSI  caused 
a  flurry  of  research  during  the  past  half  decade.  This  research  resulted 
in  a  number  of  device  proposals  which  we  may  call  algorithmically 
specialized  processors.  By  focusing  on  computationally  intensive 
problems  and  carefully  dissecting  algorithms  for  them,  researchers  have 
developed  algorithmically  specialized  processors  having  several  important 
characteristics : 

.  construction  is  based  on  a  few  easily  tessellated  processing 
elements, 

.  locality  is  exploited,  that  is,  data  movement  is  often  limited 
to  adjacent  processing  elements, 

.  pipelining  is  used  to  achieve  high  processor  utilization. 

Kxamples  of  algorithmically  specialized  processors  include  designs  for  LU 
decomposition  [2,3]  (the  main  step  in  solving  systems  of  linear  equations), 
the  solution  of  linear  recurrences  [2], tree  processors  [4,5,6]  (used  in 


searching,  sorting  and  expression  evaluation),  dynamic  programming  [7] 

(a  general  problem  solving  technique  with  numerous  applications),  join 
processing  [8]  ( for  data  base  querying),  and  may  others. 

Algorithmically  specialized  processing  components  must  usually  be 
joined  together  to  solve  a  large  computationally  intensive  problem. 

This  composition  step  is  crucial  since  whole  problems  tend  to  be 
multiphased  and  these  components  tend  to  be  specialized  to  an  algorithm 
used  in  only  one  phase.  For  example,  to  solve  a  system  of  linear 
equations  (Ax=b)  one  might  use  a  processor  component  to  form  the  LU 
decomposition  of  the  matrix  A  ( A=LU )  and  then  use  a  linear  recurrence 
solver  component  to  perform  the  substitution  phases  ( Ly=b  and  Ux=y) . 

As  another  example,  queries  in  data  base  query  languages  are  formed 
by  composing  operations  such  as  "search"  and  "join". 

If  the  component  processors  are  implemented  on  chips,  one  way  to 
compose  them  is  to  wire  them  together.  This  solution  is  inflexible  since 
the  components  are  dedicated  to  a  particular  problem  and  cannot  be  used 
for  another  problem.  Another  compositional  scheme  is  to  join  the 
processors  to  a  bus  as  "pheripherals."  This  is  more  flexible  since  a 
processor  can  be  used  in  different  phases,  but  the  bus  becomes  a 
bottleneck  and  time  is  wasted  in  interphase  data  movement. 

A  more  flexible  approach  is  to  replace  the  dedicated  processing 
elements  with  more  general  microprocessors  and  simply  to  program  the 
algorithmically  specialized  processors.  This  solution  is  much  more 
flexible  since  different  components  can  use  the  same  devices  by  changing 
programs  (provided  the  interconnection  pattern  is  the  same) .  The  bus 
bottleneck  is  eliminated.  There  is  a  loss  in  performance  with  this 


polymorphism,  since  circuit  implementation  of  the  primitive  actions  is 

replaced  by  the  slower  process  of  instruction  execution. 

But  the  main  problem  with  this  approach  is  that  not  all  algorithmically 

specialized  processors  use  the  same  interconnection  structure  (see  Figure  1). 

There  is  no  guarantee  that  the  consecutive  phases  of  the  computation  can 

be  done  efficiently  in  place.  For  example,  if  we  have  an  n  *  n  mesh 

o 

connected  microprocessor  structure  and  want  to  find  the  maximum  of  n 
elements  stored  one  per  processor,  n  steps  are  necessary  and  sufficient 
to  solve  the  problem.  But  a  faster  algorithmically  specialized  processor 
for  this  problem  uses  a  tree  interconnection  pattern  to  find  the  solution 
in  2  log  n  steps.  For  large  n  this  is  a  benefit  worth  seeking.  Again, 
a  bus  can  be  introduced  to  link  several  differently  connected  multiprocessors 
including  mesh  and  tree  connected  multiprocessors.  Data  could  be  transferred 
when  a  change  in  the  processor  structure  would  be  beneficial.  But  the 

bottleneck  is  quite  serious  -  in  the  example,  data  has  to  be  transferred 

2 

at  n  /log  n  words  per  step  to  make  the  transfer  worthwhile.  What  we  need 
is  a  multiprocessor  with  more  polymorphism  that  does  not  compromise  the 
benefits  of  VLSI  technology. 

The  Configurable,  Highly  Parallel  (CHiP)  computer  is  a  multiprocessor 
architecture  that  provides  a  programmable  interconnection  structure  in¬ 
tegrated  with  the  processing  elements.  Its  objective  is  to  provide  the 
flexibility  needed  to  compose  general  problem  solutions  while  retaining 
the  benefits  of  uniformity  and  locality  that  the  algorithmically 
specialized  processors  exploit. 

The  CHiP  Architecture  Overview 

The  CHiP  computer  is  a  family  of  architectures  each  constructed  from 


-5- 


D-a-p-g 

iii 

n-n  o-o 


r 


d-o-o  D-o-a 


□ — □ — 6 


D-o-a  a-a-a 

(d) 


(c) 


Figure  1.  Interconnection  patterns  for  algo¬ 
rithmically  specialized  processors:  (a) 
mesh,  used  for  dynamic  programming  [7]; 
(b)  hexagonal ly  connected  mesh  used  for 
LU  decomposition  [2];  (c)  torus  used  for 
transitive  closure  [7];  (d)  binary  tree 
used  for  sorting  [4];  (e)  double  tree 
used  for  searching  [5], 
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three  components:  (a)  a  collection  of  homogeneous  microprocessors, 

(b)  a  switch  lattice  and  (c)  a  controller.  The  switch  lattice  is  the 
most  important  component  and  the  main  source  of  differences  among  family 
members . 

The  switch  lattice  is  a  regular  structure  formed  from  programmable 

switches  connected  by  data  paths.  The  microprocessors  (hereafter  called 

PEs)  are  not  directly  connected  to  each  other,  but  rather  are  connected 

at  regular  intervals  to  the  switch  lattice.  Figure  2  shows  three 

examples  of  switch  lattices.  Generally,  the  layout  will  be  square 

although  other  geometries  are  possible.  The  perimeter  switches  are 

connected  to  external  storage  devices.  A  production  CHiP  computer  might 
8  16 

have  2-2  PEs.  (With  current  technology  only  a  few  PEs  and  switches 
can  be  placed  on  a  single  chip.  As  improvements  in  fabrication  technology 
permit  higher  device  densities  per  unit  area,  a  single  chip  can  host  a 
larger  region  of  the  switch  lattice.  Moreover,  as  discussed  below,  the 
CHiP  architecture  is  quite  suitable  for  "wafer  level"  fabrication.) 

Each  switch  in  the  lattice  contains  local  memory  capable  of  storing 
several  configuration  settings.  A  configuration  setting  enables  the 
switch  to  establish  a  direct,  static  connection  among  two  or  more  of  its 
incident  data  paths.  (Notice,  this  is  circuit  switching  rather  than 
packet  switching.)  For  example,  we  achieve  a  mesh  interconnection 
pattern  of  the  PEs  for  the  lattice  in  Figure  2(a)  by  assigning  North-South 
configuration  settings  to  alternate  switches  in  odd  numbered  rows  and 
East-West  settings  to  switches  in  the  even  rows.  Figure  3  illustrates 
the  configuration;  Figure  4  gives  the  configuration  settings  of  a  binary 
tree. 


Figure  2.  T! 

C; 
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lhe  controller  is  responsible  for  loading  the  switch  memory.  (This 
task  is  performed  via  a  separate  interconnection  "skeleton"  that  is 
transparent  to  this  discussion.)  The  switch  memory  is  loaded  pre¬ 
paratory  to  processing  and  is  performed  in  parallel  with  the  PE  program 
memory  loading.  Typically,  program  and  switch  settings  for  several 
phases  can  be  loaded  together.  The  chief  requirement  is  that  the  local 
configuration  settings  for  each  phase's  interconnection  pattern  be 
assigned  to  the  same  memory  location  in  all  switches.  For  example,  in 
each  switch,  location  1  might  be  used  to  store  the  local  configuration 
to  implement  a  mesh  pattern,  location  2  might  store  the  local 
configuration  for  the  tree  interconnection  pattern,  etc. 

CHIP  processing  begins  with  the  controller  broadcasting  a  command 
to  all  switches  to  invoke  a  particular  configuration  setting.  For 
example,  suppose  it  is  the  setting  stored  at  location  1  that  implements 
a  mesh  pattern.  With  the  entire  structure  interconnected  into  a  mesh, 
the  individual  PEs  synchronously  execute  the  instructions  stored  in 
their  local  memory.  PEs  need  not  know  to  whom  they  are  connected;  they 
simply  execute  instructions  such  as  READ  EAST,  WRITE  NORTH  WEST,  etc. 

The  configuration  remains  static.  When  a  new  phase  of  processing  is  to 
begin,  the  controller  broadcasts  a  command  to  all  switches  to  invoke  a 
new  configuration  setting,  say  the  one  stored  at  location  2  implementing 
a  tree.  With  the  lattice  restructured  into  a  tree  interconnection  pattern, 
the  PEs  resume  processing,  having  spent  only  a  single  logical  step  in 
interphase  structure  reconfiguration. 

The  overview  of  the  CHiP  computer  family  has  been  superficial,  but 
it  has  provided  a  context  in  which  to  present  a  more  thorough  treatment. 

(A  comparison  of  the  CHiP  architecture  with  other  interconnection  methods 
is  given  in  reference  [12]). 


k 
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The  next  three  sections  are: 

A  closer  look ,  giving  details  about  switches,  lattices  and 
the  controller 

Embedding  an  interconnection  structure ,  an  example  of  how  to 
configure  the  lattice  into  a  complete  binary  tree,  and 

Solving  a  system  of  linear  equations ,  illustrating  how  a 
multiphased  problem  might  be  solved. 

We  conclude  with  a  Discussion  section  in  which  we  mention  some  of  the 

consequences  of  the  CHiP  architecture  approach. 

A  Closer  Look 

We  review  some  of  the  characteristics  that  distinguish  members  of  the 
family  of  CHiP  computers. 

Switches.  It  is  convenient  to  think  of  switches  as  being  defined  by 
several  parameters. 

m  -  the  number  of  wires  entering  a  switch  on  one  data  path,  or  data 
path  width, 

d  -  the  degree,  or  number  of  incident  data  paths, 

c  -  the  number  of  configuration  settings  that  can  be  stored  in  a 
switch. 

The  value  of  m  reflects  the  balance  struck  between  parallel  and  serial 
data  transmission.  This  balance  will  be  influenced  by  several  considerations, 
one  of  which  is  the  limited  number  of  pins  on  the  package  containing  the 
chips  of  the  CHiP  lattice.  Specifically,  if  a  chip  hosts  a  square  region 
of  the  lattice  containing  n  PEs,  then  the  number  of  pins  required  is 
proportional  to  m/n. 

The  value  of  d  will  usually  be  4,  as  in  Figure  2(a),  or  8,  as 
in  Figure  2(c).  Figure  2(b)  shows  a  mixed  strategy  which  exploits 
the  fact  that  switches  tend  to  be  used  in  two  different  roles.  Switches 
at  the  intersection  of  the  vertical  and  horizontal  switch  corridors  tend 


-11- 


to  perform  most  of  the  routing  while  those  interposed  between  two 
adjacent  PEs  act  more  like  extended  PE  ports  for  selecting  data  paths 
from  the  "corridor  buses".  Specializing  the  degree  of  the  switch  to 
these  activities  reduces  the  number  of  bits  required  to  specify  a 

configuration  setting  and  thus  saves  area. 

The  value  of  c  is  influenced  by  the  number  of  configurations  that  are 

likely  to  be  needed  for  a  multiphase  computation  and  the  number  of  bits 
required  per  setting.  This  latter  number  depends  on  the  degree  and  the 
crossover  capability  of  the  switch. 

"Crossover  capability"  is  a  property  of  switches  referring  to  the 
number  of  distinct  data  path  groups  that  a  switch  can  simultaneously 
connect.  We  speak  of  data  path  "groups"  rather  than  data  path  pairs 
since  fanout  is  permitted  at  a  switch,  i.e.  a  switch  can  connect  more 
than  a  pair  of  data  paths.  Crossover  capability  is  specified  by  an 
integer  g  in  the  range  1  to  d/2,  i.e.  1  indicates  no  crossover  and 
d/2  is  the  maximum  number  of  distinct  paths  intersecting  at  a  degree  d 
switch.  Like  the  three  parameters  mentioned  above,  the  crossover 
capability  g  is  fixed  at  fabrication  time. 

The  number  of  bits  of  storage  needed  for  a  switch  is  modest,  dgc. 

This  provides  a  bit  for  each  direction  for  each  crossover  group  for  each 
configuration  setting.  A  technique  to  reduce  this  value  is  to  provide 
for  the  loading  of  switch  settings  while  the  CHiP  processor  is  executing. 
This  quality,  called  "asyncronous  loading",  permits  a  smaller  value  of  a 
by  taking  advantage  of  two  facts:  algorithms  often  use  configurations  that 
differ  in  only  a  few  places,  and  configurations  often  remain  in  effect 
long  enough  to  provide  time  to  prepare  for  future  settings. 

Lattice.  From  Figure  2  it  is  clear  that  lattices  can  differ  in 
several  characteristics.  The  PE  degree,  like  the  switch  degree,  is  the 


-12 


number  of  incident  data  paths.  Most  algorithms  of  interest  use  PEs  of 
degree  eight  or  less.  Larger  degrees  are  probably  not  necessary  since 
they  can  be  achieved  either  by  multiplexing  data  paths  or,  with  some 
loss  in  PE  utilization,  by  logically  coupling  processing  elements,  e.g. 
two  degree  four  PEs  could  be  coupled  to  form  a  degree  six  PE  where  one 
serves  only  as  a  buffer. 

Call  the  number  of  data  paths  that  separate  two  adjacent  PEs  the 
corridor  tiidth,  u.  (See  Figure  2(c)  for  a  u  =  2  lattice.)  This  is 
perhaps  the  most  significant  parameter  of  a  lattice  since  it  influences 
the  efficiency  of  PE  utilization,  the  convenience  of  interconnection 
pattern  embeddings,  and  the  overhead  required  for  the  polymorphism. 

To  see  the  impact  of  corridor  width,  let  us  embrace  graph  embedding 
parlance  and  say  that  a  switch  lattice  hosts  a  PE  interconnection  pattern. 
In  theory,  even  the  simplest  lattice  (like  the  one  in  Figure  2(a))  can 
host  an  arbitrary  interconnection  pattern.  But  to  do  so  may  require  the 
PEs  to  be  underutilized  for  two  reasons.  First  PEs  may  be  coupled  to 
achieve  high  PE  degree  as  mentioned  at  the  beginning  of  this  section. 
Second,  and  more  importantly,  adjacent  PEs  in  the  (logical)  guest  inter¬ 
connection  pattern  may  have  to  be  assigned  to  widely  spaced  PEs  in  the 
hosting  lattice  (i.e.  separated  by  unused  PEs)  in  order  to  provide 
sufficiently  many  data  paths  for  the  edges.  (Figure  5  shows  the  embedding 
of  K4  ^  in  the  lattice  of  Figure  2(c)  where  the  center  column  of  PEs  is 
unused.)  Increasing  corridor  width  improves  processor  utilization  when 
complex  interconnection  patterns  must  be  embedded  since  it  provides  more 
data  paths  per  unit  area. 

How  wide  should  corridors  be?  It  all  depends  on  which  interconnection 
patterns  are  likely  to  be  hosted  and  how  economically  necessary  it  is  to 
maximize  PE  utilization.  For  most  of  the  algorithmically  specialized 
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Figure  5.  Graph  4  shown  in  (a)  is  embedded  into  the  lattice  of 
Figure  2(c)  using  a  switch  with  crossover  value  g  =  2. 


I < j  iK  cssors  deve  1  oped  tor  VI.SI  impleme'ntat  ion,  a  corridor  width  of  two 
suffice  to  achieve  optimal  01  near  optimal  PI:  utilization.  However, 
to  be  sure'  of  hosting  all  planar  interconnection  patterns  of  u  nodes  with 
reasonahl)  complete  processor  utilization,  a  width  proportional  to  :  yt  n 
suffices  and  may  be  necessary  1 1*  I .  !o  host  patterns  such  as  the  shuffle- 
exchange  graph  with  high  efficiency  will  require  still  wider  corridors, 
on  t  lie  average  must  be  at  least  proportional  to  ►:  |l()j, 

selecting  a  corridor  width  is  a  difficult  decision,  especial 1\  if 
it  is  a  nonconstant  width,  'I  he  benefit  is  higher  PI:  utilization  in  some 
cases;  the  .  .ei  is  j  loss  ,>r  -  nine  locality  in  all  cases,  introduction  ot 
more  area  overhead,  and  increased  problems  with  "pin"  limitations. 
l*r  cl  i  mi  riarv  ev  i  tie  tic  e  indicates  that  ^  1  ptovides  a  leasonahle 
a. vs t / bene f 1 1  trade  iff,  but  fur t hot  experiaent.it  ion  and  analysis  are 
r  c  i ,  u  t  red  . 

I  <  ' 'lj  v  \  ‘  <  e c .  : >s  *  *  *, : > :  •  r  *  ‘  >  e>. 

In  addition  r  r  j  the  convent  iona  1  pc  1 vmorph i sm  derived  from  ('!'  pro¬ 
gramming,  we  have  provided  for  a  second  kind  of  polymorphism  -  the 
prugraminah 1 e  switches.  Th i s  room  res  us  to  provide  for  interconnect  ion 
pattern  programming,  i .o.  the  specification  of  a  global  interconnection 
pattern.  When  viewed  »n  a  programming  language  context,  the  "soon  e 
progjam"  is  a  global  inteia  ■nno  t  uni  pattern  that  t  c  otm  tier  translates 
into  an  "object  code"  el'  individual  :.wi*ch  settings  suitable  for  loading 
into  the  switches  by  the  Clil’  controller.  The  genenl  programming  language 
and  compiler  issues  need  not  concern  ti-  here,  however,  tor  we  will  explore 
onlv  one  particular  interconnection  pattern:  the  cvimplete  binary  tree. 

This  example  will  enable  us  to  illustrate  the  differences  between 
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embedding  into  the  plane  and  embedding  into  the  CHiP  lattice. 

The  complete  binary  tree  has  f*-l  PE's,  one  at  each  node.  One 
possible  layout  of  this  structure  in  the  CHiP  lattice  is  a  direct 
translation  of  the  "hyper-H"  strategy  [1]  illustrated  in  Figure  1(d). 
Figure  6  illustrates  this  embedding  into  the  lattice  of  Figure  2(a)  and 
it  is  clear  that  a  significant  number  (approaching  one  half)  of  the  PEs 
are  used  in  naive  approach.  The  problem  is  that  although  the 

hyper-H  is  an  excellent  embedding  on  plain  silicon  where  the  placement 
of  PEs  and  data  paths  is  arbitrary,  CHiP  lattice  embeddings  must  conform 
to  the  prespecified  PE  and  data  path  sites.  As  we  shall  see,  this 
constraint  is  not  onerous. 

To  illustrate  an  optimal  embedding  (in  terms  of  maximizing  the 

v 

use  of  PEs),  assume  that  we  have  an  n  x  n  CHiP  lattice  where  n  »  2 

2k 

for  some  integer  k.  This  gives  2  PEs,  so  a  binary  tree  of  depth  2k 

2k 

fits  with  only  one  unused  PE,  since  it  has  2  -1  nodes.  Call  this 

unused  PE  a  "spare." 

We  proceed  inductively  by  pairing  two  embedded  subtrees  to  form 
a  new  tree  one  level  higher.  For  the  basis  of  the  induction  it  is 
convenient  to  use  a  three  node  binary  tree  embedded  with  one  spare  in 
a  2  *  2  portion  of  the  lattice.  Pairing  square  subtree  embeddings 

produces  rectangles  with  sides  in  ratio  2:1.  Pairing  these  rectangles 

'’k 

yields  squares  again.  In  general  we  pair  two  subtrees  each  with  2 °  -1 

2k+l 

nodes  and  a  spare  to  produce  a  new  2  -1  node  tree  in  which  one  of  the 


subtree's  spares  becomes  the  root  of  the  new  tree  and  the  other  spare 
becomes  the  spare  of  the  new  tree.  The  interesting  problem  is  to  place 
the  spares  at  the  proper  sites  for  the  next  step  in  the  induction. 


Figure  b.  The  hyper-H  tree  (Figure  1(d))  embedded  into  the  switch 
lattice  of  Figure  2(a);  the  switches  are  not  shown. 

If  wc  adopt  the  strategy  of  the  hyper-H  embedding  .and  locate  the 
root  at  the  center  of  the  tree,  then  it  makes  sense  to  place  a  spare  at 
the  middle  of  one  side  so  that  when  this  tree  is  paired  to  form  the  next 
larger  tree,  there  is  a  spare  at  the  interface  ready  to  become  the  new 
root.  This  will  be  in  the  center  of  the  new  tree  as  we  intend.  (Of 
course,  since  the  sides  always  have  an  even  number  of  PEs,  "middle" 
here  means  adjacent  to  the  midpoint  of  one  side.)  But  we  cannot 
pair  two  trees  with  their  spares  in  the  middle  of  one  side  since  this 
will  leave  us  with  either  a  buried  spare  that  is  useless  for  forming 
the  next  larger  tree  or  it  will  leave  us  with  a  spare  on  the  perimeter 
at  a  site  inappropriate  for  the  embedding  of  the  next  larger  tree. 

(See  Figure  7.J 

The  solution  is  to  pair  one  subtree  with  a  spare  located  at  the 
middle  of  one  side  with  a  subtree  whose  spare  is  at  the  corner.  The 
spare  in  the  middle  becomes  the  root  of  the  new  tree  and  the  corner  spare 


1 


Figure  7.  Pairing  subtrees  using  spares  located  at  the 
midpoint  of  one  side. 

can  be  located  (using  reflection)  to  become  either  a  middle  spare  or  a 
corner  spare  of  the  new  tree  depending  on  which  is  needed  for  the  next 
inductive  step.  Thus,  at  each  step  in  the  induction  we  must  use  (and 
we  can  create)  two  types  of  embeddings:  middles  and  corners.  (See 
Figure  8.)  Notice  that  the  basis  tree,  embedded  in  a  2  x  2  portion  of 
the  lattice,  actually  serves  as  both  types. 

Trees,  of  course,  are  planar;  that  is,  they  can  be  embedded  in  the 
plane  without  crossovers.  But  if  the  reader  endeavors  to  follow  the 
preceding  algorithm  with  the  lattice  in  Figure  2(a),  it  will  appear  as 
though  crossovers  are  required,  at  least  during  the  early  stages  of  the 
embedding.  It  is  possible,  using  basis  elements  of  fifteen  node  trees 
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Figure  8.  The  formation  of  "middies"  and  "corners"  embeddings 
using  a  middle  and  corner  pair. 

embedded  in  4  *4  square  regions  of  the  lattice,  to  achieve  a  completely 
planar  embedding.  A  solution  is  shown  in  Figure  9. 

Solving  a  Sue.  tew  of  I. inear  F.quationa 
In  order  to  illustrate  how  the  CHiP  processor  can  be  used  to  compose 
algorithms,  we  pose  the  problem  of  solving  a  system  of  linear  equations, 
i.e.  to  •olvc  A.r  -  !'  for  in  r  *  n  coefficient  matrix  A  of  bandwidth  u 


and  k  vector  }  .  We  shall  use  two  algorithmically  specialized  processors 


f'i  gure  9 


Planar  embedding  of  a  255  node  complete  binary 
tree  into  the  lattice  of  figure  2(a). 


-20- 


due  to  H.T.  Rung  and  C.E.  Leiserson  as  described  in  Mead  and  Conway  [1], 
The  first  is  an  LU-decomposition  systolic  array  processor  that  factors  A 
into  upper  and  lower  traingular  matrices  U  and  L. 
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The  second  systolic  processor  solves  a  lower  triangular  linear  system 
'y  -  l  where  L  is  the  output  from  the  decomposition  step.  (We  call  this 
the  LTS  solver.)  The  final  result  vector  x  can  be  found  by  solving 
■Jx  =  u  where  U  is  the  upper  triangular  matrix  from  the  first  step  and  y 
is  the  vector  output  of  the  second  step.  By  rewriting  U  as  a  lower 
triangular  system  we  can  reuse  the  LTS  solver.  Our  approach  will  be  to 
compose  these  pieces  into  a  harmonious  process  to  solve  the  entire 
problem. 

The  first  problem  we  must  solve  is  the  embedding  of  the  Kung-heiserson 
systolic  processors.  These  algorithmically  specialized  processors  are 
defined  for  n  x  n  arrays  of  bandwidth  V.  (Figure  10  shows  the  LU- 
decomposition  processor  for  a  w  =  7  system.  Figure  11  shows  a  suitable 

=  4  lower  triangular  system  solver  processor.)  Since  the  LU-decomposition 
processor  is  hcxagonally  connected,  it  will  be  convenient  to  embed  the 
processors  into  the  lattice  shown  in  Figure  2(b).  The  obvious  strategy 


is  to  connect  the  processors  in  such  a  way  that  the  lower  triangular 
output  L  of  the  decomposition  step  connects  directly  to  the  input  of 
the  lower  triangular  system  solver.  It  is  also  obvious  that  these 
embeddings  should  be  placed  at  the  perimeter  of  the  CHiP  lattice  so  that 
matrix  A  and  vector  b  can  be  received  from  external  storage.  Figure  12 
shows  such  an  embedding*  where  the  PE  labellings  correspond  to  those 
given  in  Figures  10  and  11. 
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Figure  10.  The  Kung-Leiserson  systolic  array  for  LU-decomposition. 
Labellings  indicate  data  paths.  For  timings,  see 
reference  [1]. 


Although  the  data  paths  are  bidirectional,  we  have  used  arrows  to  emphasi 
the  direction  of  data  movement. 


Figure  11.  The  Kung-Leiserson  systolic  LTS  solver  for  u =4.  labelling 
indicate  data  paths  for  elements  of  /,  and  b.  for  timings 
see  reference  [lj. 


Figure  1  .  The  embedding  of  the  Lil-decoinpos i  t  ion  processor  and 
the  ITS  solver  in  the  lattice  of  Figure  2(b).  ?F. 

labellings  correspond  to  Figure  10  and  11. 


-23- 


Several  simple  transformations  have  been  employed  to  accomplish 
the  embedding.  The  most  noticable  is  that  the  hexagonal  structure  has 
been  slightly  deformed  to  accomodate  the  rectangular  CHiP  lattice  and 
the  LU-decomposition  processor  has  been  rotated  clockwise  120°.  The 
constant  inputs  (0's  and  -1)  that  appear  on  the  perimeter  of  the  systolic 
array  have  been  suppressed  since  they  can  be  generated  internally  to  the 
PEs.  The  output  wires  carrying  the  L  matrix  result  have  been  assigned 
to  one  of  the  available  ports  and  routed  to  the  inputs  of  the  LTS  solver. 
Finally,  to  embed  the  double  channel  between  PEs  of  the  LTS  solver  we 

have  routed  data  diagonally  out  of  the  North-East  port  into  the  South-East 

/ 

port.  Notice  that  since  the  diagonal  elements  of  L  are  all  1,  they  are  not 
explicitly  produced. 

The  next  problem  to  solve  is  the  rewriting  of  U  as  a  lower 
triangular  sys  em  suitable  for  input  into  another  embedded  LTS  solver. 

We  must  wait  until  U  has  been  entirely  produced  before  performing  this 
operation.  So,  rather  than  writing  the  elements  of  U  to  external  storage 
as  they  are  produced,  we  thread  them  through  the  lattice  (assuming  there 
is  sufficient  space  to  store  them  all).  We  also  thread  the  y  vector 
output  from  the  LTS  process  along  with  U.  Then  in  the  second  phase  of 
our  algorithm,  we  can  process  the  elements  through  another  embedded  LTS 
solver. 

Perhaps  the  most  elegant  way  to  thread  •'  and  y  through  the  lattice 
is  to  use  a  graph  embedding  due  to  Aleliunas  and  Rosenberg  [13].  The 
scheme  has  the  advantage  of  not  requiring  a  large  "bundle"  of  wires  along 
the  perimeter  of  the  lattice  when  the  threads  double  back.  (Figure  13 
illustrates  the  embedding  required  for  doubling  back.)  As  the  V  and  y 
values  are  produced,  they  are  passed  from  PE  to  PE.  (They  could  be 


"concentrated"  by  storing  several  per  PE.)  When  V  and  y  are  completel 
produced,  the  first  phase  is  completed. 


Figure  13.  The  Ale  1 iunus-Rosenherg  embedding  of  the  threads 
doubling  back. 


Between  the  first  and  second  phases  we  make  a  minor  reconfiguration. 
(This  reconfiguration  would  not  have  been  necessary  had  the  phase  1 
configuration  been  somewhat  more  clever;  but  as  an  example,  it  would  also 
have  been  somewhat  more  confusing.)  The  second  configuration  embeds  the 
LTS  solver  into  the  fourth  row  of  processors  as  illustrated  in  Figure  14. 


Figure  14.  The  simple  phase  2  embedding 
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The  inputs  to  this  group  of  processors  come  from  reversing  the  direction 
of  flow  of  the  threaded  values  from  phase  1.  Notice  that  this  reversal 
of  flow  has  the  effect  of  renumbering  the  matrix  U  to  be  in  lower 
triangular  form  appropriate  for  the  LTS  solver.  The  appropriate  values 
of  the  y  vector  are  also  available  at  the  proper  locations.  The  outputs 
from  the  second  phase  eminate  from  the  western  port  of  processor  (4,1). 
These  are  the  values  solving  Ax  =  b. 

To  summarize,  the  system  of  linear  equations  Ax  =  b  is  solved  in  two 
phases  on  the  CHiP  processor.  In  phase  1  an  embedded  LU-decomposition 
processor  takes  A  as  input  and  produces  matrices  L  and  U  as  output.  The 
L  output  is  immediately  input  to  an  LTS  solver  that  also  takes  b  as  input 
and  solves  Ly  =  b .  The  vector  y  and  the  matrix  U  are  threaded  through  the 
lattice.  Phase  1  completes  when  A  has  been  decomposed.  In  phase  2 
another  embedded  LTS  solver  takes  the  threaded  output  from  phase  1  (by 
reversing  its  flow)  and  solves  Ux  =  y. 

Phase  2  makes  scant  use  of  parallelism  -  it  runs  in  the  same  time  as 
phase  1  and  the  data  are  already  in  the  CHiP  processor.  And  as  noted,  the 
interphase  reconfiguration  was  not  essential.  But,  there  are  algorithms 
to  solve  the  phase  2  problem  that  do  make  essential  use  of  configurability 
to  make  effective  use  of  parallelism  [14].  A  complete  development  of  the 
approach  is  not  possible  here,  but  the  essential  idea  due  to  Chen,  Kuck 
and  Sameh  [11]  is  straightforward:  A  transformantion  on  U  enables  us  to 
decompose  the  matrix  into  blocks  whose  product  yields  the  result. 

Because  the  product  operation  is  associative,  the  whole  product  can  be 
formed  by  taking  paiiwise  products  in  parallel,  then  pairwise  products 
of  the  results,  etc.  By  reconfiguring  the  threaded  portion  of  the  lattice 
using  one  of  several  rather  complicated  interconnection  patterns  that 


either  implicitly  or  explicitly  embed  a  tree,  we  can  perform  these  pairwise 
products  in  parallel.  The  result  is  a  faster  parallel  algorithm  made 
possible  by  configurability. 


Discussion 

Several  characteristics  of  the  CHiP  approach  should  be  mentioned. 

First,  the  algorithmically  specialized  processors  translate  mutatis 
mutandis  to  programs  for  the  CHiP  computer.  Thus,  we  have  a  ready 
supply  of  algorithms  that  can  effectively  use  the  parallel  processor. 

Of  course,  all  of  these  algorithms  use  one  interconnection  structure, 
and  it  is  possible  that  improved  algorithms  might  be  found  that  exploit 
the  availability  of  multiple  interconnection  structures. 

Second,  configurability  provides  both  interphase  and  intraphase 
flexibility.  This  distinction,  though  not  very  clear-cut,  tends  to 
correlate  with  whether  or  not  pipelining  is  being  used.  If  a  problem  is 
solved  by  a  sequence  of  phases  that  each  complete  before  the  next  one 
begins,  we  tend  to  use  regular  configurations  that  change  at  the  completion 
of  a  phase  (interphase).  The  whole  lattice  is  in  a  mesh  or  tree  pattern. 
For  a  series  of  pipelined  algorithms  that  can  be  coupled  together,  as  in 
the  last  section  ,  we  tend  to  form  regions  of  the  lattice  dedicated  to  each 
algorithm  with  data  paths  interconnecting  the  regions.  We  refer  to  this  as 
intraphase  configurability  because  within  one  phase  we  interconnect 
several  regular  structures.  Clearly,  we  need  not  change  configurations 
to  exploit  the  advantage  of  configurability. 

Both  kinds  of  configurability  are  useful  in  adapting  to  changes  in 
problem  size.  For  example,  two  different  small  problems  might  operate 
concurrently  on  different  regions  of  the  CHiP  processor  using  entirely 
different  interconnection  schemes.  One  pattern  could  change  while  the 


other  remained  fixed  by  loading  switches  of  the  fixed  region  with  two 
copies  of  the  same  configuration  setting.  Pipelined  processors,  whose 
sice  is  usually  a  function  of  the  input  width,  can  be  tailored  to  the 
right  site  at  loading  time. 

Another  consequence  of  configurability  is  that  it  is  quite  fault 
tolerant.  Supposing  than  an  error  is  detected  in  a  processor,  data  path 
or  switch,  we  can  simply  route  around  the  offending  device.  For  convenience, 
we  might  choose  to  leave  other  processors  unused  to  "square  up"  the 
iattice  when  matching  dimensions  are  important. 

Perhaps  the  most  intriguing  consequence  of  configurability's  fault 
tolerance  is  the  possibility  of  "wafer  level"  fabrication.  That  is, 
instead  of  dicing  a  wafer  and  discarding  the  faults  processor  chips,  we 
can  leave  a  VLSI  wafer  whole  and  simply  route  around  the  unusable 
processors.  (We  could  use  the  dicing  corridors  for  data  paths,  and 
switches.)  For  example  if  a  wafer  contains  100  processor  chips  and 
yield  characteristic?  indicate  that  roughly  one  third  are  faulty,  then 
a  wafer  is  acceptable  if  we  can  find  8  *  8  sublattice  that  is  functional. 
The  mapping  of  the  switches  to  host  the  8  *  in  the  100  could  be  done 
on  the  wafer  by  special  circuitry  designed  for  that  purpose.  Although  the 
number  of  pins  required  for  the  wafer  would  be  large,  their  number  is  only- 
proportional  to  the  perimeter  rather  than  the  area.  This  actually  reduces 
the  number  of  wires  bonded. 

Zunr.aru 

By  integrating  programmable  switches  with  the  processing  elements. 


the  ClliP  computer  achieves  a  polymorphism  of  interconnection  structure 
that  also  preserves  locality.  This  enables  us  to  compose  algorithms  that 


exploit  different  interconnection  patterns.  In  addition  to  responding 
to  different  problem  sizes  and  characteristics,  the  flexibility  of 
integrated  switches  provides  substantial  fault  tolerance  and  permits 
wafer  level  fabrication. 
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